Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent allocation of currently allocated resources #1046

Merged
merged 2 commits into from
Jul 11, 2023

Conversation

milroy
Copy link
Member

@milroy milroy commented Jul 8, 2023

Problem: issue #1043 identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a free message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion time is exceeded, allowing the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP.

This PR adds a check for an existing allocation on each exclusive resource vertex during traversal pruning. The check only occurs for AT_ALLOC traversals to allow reservations and satisfiability traversals to succeed. The check for existing allocations prevents another job from receiving the resources and allows reservations and satisfiability checks to complete. It also ensures that the check will block AT_ALLOC traversals from matching the resources until the jobid is removed from the per-vertex allocations map via a free message from job-manager.

The check has the (I think) unavoidable side effect that reservations on the allocated subgraph will be allowed immediately after the at time of the AT_ALLOC traversal because the planner_multi_avail_time_first call will regard extra resources as available via incorrect aggregate counts. The side effect is minor, though, as the reservation will be blocked from transitioning to allocated until a free message cancels the existing allocation. During the subsequent scheduling loops, if the AT_ALLOC traversal fails the reservation will be pushed back during the AT_ALLOC_ORELSE_RESERVE traversal. That will continue until matching resources become available.

The PR fixes the reproducer from flux-framework/flux-core#5304 and passes the test in PR #1044, but it's WIP because the behavior for non-exclusive resource requests is still suspect.

I chose to place the check in the by_excl part of prune because there can only be a single allocation in (*m_graph)[u].schedule.allocations for an exclusive vertex. While checking at non-exclusively allocated vertices is possible, the logic for determining which jobid is delayed via comparing availability counts returned by planner to the jobs in the (*m_graph)[u].schedule.allocations map and then determining the true number of available resources is much more complex.

Only checking at exclusive vertices can involve a deeper traversal than necessary (i.e. by checking at higher, non-exclusive vertices), but I think it will correctly block allocations provided that leaf-level vertices must be allocated exclusively. That appears to be ensured here:

if (exclusive_in && resource.exclusive == Jobspec::tristate_t::FALSE) {

There are two items to verify before the WIP can be dropped:

  • Make sure leaf-level vertices must be allocated exclusively
  • Check the behavior for upwalks into an auxiliary subsystem
  • Change test_expect_failure to test_expect_success in t/t1024-alloc-check.t after PR testsuite: add test for double-booking #1044 is merged

This PR should be merged after PR #1044.

@milroy milroy changed the title [WIP] traverser: prevent allocation of currently allocated resources [WIP] Prevent allocation of currently allocated resources Jul 8, 2023
@garlick
Copy link
Member

garlick commented Jul 8, 2023

Thanks @milroy for taking the time to write up the explanation.

#1044 has been merged so at some point, this PR can be rebased on master, and a commit can be added that changes the two test_expect_failure tests in t1024-alloc-check.t to test_expect_success.

@grondo
Copy link
Contributor

grondo commented Jul 10, 2023

Just checking in here since deadline for new packages for the next TOSS release is tomorrow. I think the CI failures are just due to the fact that the expect failure alloc-check tests need to be updated to expect success:

# 2 known breakage(s) vanished; please update test(s)

Happy to push an update here if that would help @milroy.

@milroy
Copy link
Member Author

milroy commented Jul 10, 2023

@grondo: just got out of meetings and rebased and pushed the update to the testsuite. I'm going to check the two remaining items in the description and drop the WIP tag if possible.

@milroy
Copy link
Member Author

milroy commented Jul 11, 2023

I've confirmed that the exclusivity check here:

if (exclusive_in && resource.exclusive == Jobspec::tristate_t::FALSE) {

ensures that everything under slot must be allocated exclusively. For example, the following simple jobspec:

version: 9999
resources:
  - type: cluster
    count: 1
    with:
      - type: rack
        count: 1
        with:
          - type: node
            count: 1
            with:
              - type: socket
                count: 1
                with:
                  - type: slot
                    label: corelevel
                    count: 1
                    with:
                     - type: core
                       exclusive: false
                       count: 1

# a comment
attributes:
  system:
    duration: 3600
tasks:
  - command: [ "app" ]
    slot: corelevel
    count:
      per_slot: 1

Produces the following (overly verbose- we'll need to fix that) match error:

resource-query> m allocate t/data/resource/jobspecs/exclusive/test001.yaml
ERROR: by_excl: exclusivity conflicts at jobspec= : vertex=core0by_excl: exclusivity conflicts at jobspec= : vertex=core1
[...]

From that standpoint this PR should work as intended. However, is Fluxion conforming to RFC 14 with the jobspec above? In particular, here: https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_14.html#reserved-resource-types, which states:

A task slot SHALL have at least one edge specified using with:, and the resources associated with a slot SHALL be exclusively allocated to the program described in the jobspec, unless otherwise specified in the exclusive field of the associated resource.

The last sentence implies exclusive: false at a leaf vertex should be supported by Fluxion. @grondo and @garlick is my assessment correct?

@garlick
Copy link
Member

garlick commented Jul 11, 2023

Honestly I do not know. We don't have command line tooling that can express that AFAIK. Maybe we could open an issue on that one for later?

@milroy
Copy link
Member Author

milroy commented Jul 11, 2023

Maybe we could open an issue on that one for later?

My concern is that if it's decided that Fluxion needs to support requests like the one above then the fix in this PR won't work.

@garlick
Copy link
Member

garlick commented Jul 11, 2023

I could be mistaken but I think jobspec v1 would forbid that type of slot definition. See https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_25.html#v1-specific-resource-graph-restrictions. flux-core ensures that only v1 compliant jobspec can be submitted to the scheduler, so...is that sufficient justification for deferring this concern?

@milroy
Copy link
Member Author

milroy commented Jul 11, 2023

It looks like the intention was to make an exclusive: false request under slot invalid in RFC 25: flux-framework/rfc#189 and 3552b2c#r309027939 but I don't see that it ever happened.

@milroy
Copy link
Member Author

milroy commented Jul 11, 2023

so...is that sufficient justification for deferring this concern?

Yes, it is. I was a bit confused by the Resource Graph Restrictions section, but I think that together with:

a node type resource vertex MAY also contain the following optional keys:
exclusive

(implying no other vertices can contain that key) is strong enough.

Problem: issue flux-framework#1043 identified a scenario where
Fluxion will grant a new allocation to a job while the
resources are still occupied by the previous allocation.
The double booking occurs due to the assumption
Fluxion makes that a job will not run beyond its
walltime. However, as the issue describes, an epilog
script may cause a job to run beyond its walltime. Since
Fluxion doesn't receive a `free` message until the epilog
completes, the allocation remains in the resource graph
but the scheduled point at allocation completion is
exceeded, allowing the resources to be allocated
to another job. There are other common scenarios that can
lead to multiple concurrent allocations, such as a job
getting stuck in CLEANUP.

Add a check for an existing allocation on each exclusive
resource vertex for allocation traversals during graph
traversal pruning. This prevents another job from receiving
the resources and allows reservations and satisfiability
checks to complete.
Problem: flux-framework#1044 introduced
tests to check for allocation failures due to jobs that
exceed their time limit. PR flux-framework#1044 had to define the two
`alloc-check` tests as `test_expect_failure` because the
mechanism to prevent multiple booking was not in place.

Update the two `alloc-check` tests to test_expect_success
since the PR of this commit fixes multiple booking in
Fluxion.
@codecov
Copy link

codecov bot commented Jul 11, 2023

Codecov Report

Merging #1046 (c961577) into master (506c0c4) will increase coverage by 0.0%.
The diff coverage is 100.0%.

❗ Current head c961577 differs from pull request most recent head d0716cc. Consider uploading reports for the commit d0716cc to get more accurate results

@@          Coverage Diff           @@
##           master   #1046   +/-   ##
======================================
  Coverage    74.3%   74.4%           
======================================
  Files          86      86           
  Lines        9432    9434    +2     
======================================
+ Hits         7017    7022    +5     
+ Misses       2415    2412    -3     
Impacted Files Coverage Δ
resource/traversers/dfu_impl.cpp 85.0% <100.0%> (+<0.1%) ⬆️

... and 1 file with indirect coverage changes

@milroy
Copy link
Member Author

milroy commented Jul 11, 2023

After inspecting the auxiliary subsystem traversals and testing the fix in PR #1041 I'm satisfied that this fix will likely work (and shouldn't break) the aux traversals.

Removing WIP and requesting an additional review.

@milroy milroy changed the title [WIP] Prevent allocation of currently allocated resources Prevent allocation of currently allocated resources Jul 11, 2023
@milroy milroy requested review from garlick and trws July 11, 2023 07:17
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Just gave this a quick sanity check on my test cluster with the artificially extended prolog/epilog and it seems to work well. (Not too different from the sharness test so maybe that doesn't add much).

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too! Perhaps @jameshcorbett should add his review in case there is something here that might affect rabbit allocations.

@trws
Copy link
Member

trws commented Jul 11, 2023

Yeah, this looks good. We may end up deciding to tweak it as part of the traversal refactor I want to do, but it should be easier to express after that too. Really good idea @milroy!

Copy link
Member

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me!

@milroy
Copy link
Member Author

milroy commented Jul 11, 2023

Thanks @milroy for taking the time to write up the explanation.

No problem @garlick, and let me know if additional details would help. I didn't include a discussion on why handling early termination is much easier than handling jobs that run longer than their walltime specifically in planner-like logic because the change here sidesteps that difficulty.

Let me know if I should configure this PR to close issue #1043.

@milroy
Copy link
Member Author

milroy commented Jul 11, 2023

Thanks all for the discussion and reviews! Setting MWP.

@milroy milroy added the merge-when-passing mergify.io - merge PR automatically once CI passes label Jul 11, 2023
@mergify mergify bot merged commit bfc044e into flux-framework:master Jul 11, 2023
@milroy milroy deleted the alloc-fix branch July 11, 2023 16:50
@milroy milroy mentioned this pull request Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mergify.io - merge PR automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants